Introduction

A little history about Enron company

Enron is a natural-gas-transmission company founded in 1985 in the US. In 1990’s the US congress adopt a series of law to deregulate the sale of natural gas. This makes Enron loosing it’s exclusivity right on the natural gas pipeline. At this time Jeffrey Skilling, who was initially a consultant and later became the company’s chief operating officer, transformed Enron into a trader energy derivative to be an intermediary between natural-gas producers and their customers. Soon after that, Enron become a leader in this market and makes huge profit on its trade. This golden age for the company allow them to recruit Andrew Fastow who quickly became the chief financial officer. Moreover, they diversify their activity to include electricity, coal, paper, and steel. Perhaps, success have is limit and in late 90’s the company profit start to shrank… The pressure from shareholders, company executives began to rely on dubious accounting practices. Especially they used the “market-tomarket accounting” which allowed the company to write unrealized future gain from some trading contract into current income statement, thus giving the illusion of higher current profits. In August, 2001 some people at the head of the company start to worry about a possible accounting scandals due to this practice. In October, 2001 the Securities and Exchange Commission began investigating the transactions of Enron. This was the starting event who lead the company to the bankruptcy which really start in December, 2001.

Source Britannica Enron scandal.

Project aims

The principal aim of this project is to explore the Enron’s email data set for extracting insight about the fiscal fraud investigation and bankruptcy of the company in 2001. For that have 3 data sets:

  • the employee list with their email address

  • the emails exchange from 1999 to 2002

  • the recipients of each emails (to, cc, bcc).

The different insight will are available into a shiny apps.

For that project we used several libraries listed here: For data exploration, analysis and visualization:

To display the result into the Rmarkdown report:

To create the shiny apps:

Data exploring and cleaning

First look at the data

The aim of this part is to see :

  • which kind of data the different table contains

  • the existence of missing value and how to handle them

employee dataset

Description of the data set variables and dimension:

##       eid          firstName           lastName           Email_id        
##  Min.   :  1.00   Length:149         Length:149         Length:149        
##  1st Qu.: 38.00   Class :character   Class :character   Class :character  
##  Median : 75.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 75.07                                                           
##  3rd Qu.:112.00                                                           
##  Max.   :150.00                                                           
##                                                                           
##     Email2             Email3             EMail4             folder         
##  Length:149         Length:149         Length:149         Length:149        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##             status  
##  Employee      :41  
##  N/A           :31  
##  Vice President:23  
##  Director      :14  
##  Manager       :14  
##  (Other)       :25  
##  NA's          : 1

This data set contain 149 rows and 9 columns.

This data set contains employee ID (eid), the first and last name of the employee as well as their status, the email addresses for each employee, and the folder where their email are stored. In the status variable there exist missing value’s identify by R (NA) but also putting directly in the data by the set owner which are write N/A. The eid variable is identify has type numeric, status is associate with a factor type and the other variable are character type.

Display of some observations in the data frame:

eid firstName lastName Email_id Email2 Email3 EMail4 folder status
13 Marie Heard heard-m NA
6 Mark Taylor taylor-m Employee
19 Lindy Donoho donoho-l Employee
115 Lisa Gang gang-l N/A
129 Jeffrey Skilling skilling-j CEO
18 Lynn Blair blair-l Director
33 Kim Ward ward-k N/A
149 Kate Symes symes-k Employee
52 Kay Mann mann-k Employee
21 Keith Holst holst-k Director

By looking at the head of the data, we observed that eid is associate to numeric data type but the more adapted type seems to be factor because it is an ID for employee. In addition, the variables Email2, Email3, EMail4 contain a lot of blank.

To investigate the blank we temporary change the datatype of those variables from character to factor to see what kind of result we return for the blank observation.

employeelist %>% transform(
  Email2 = as.factor(Email2),
  Email3 = as.factor(Email3),
  EMail4 = as.factor(EMail4)
) %>% summary()
##       eid          firstName           lastName           Email_id        
##  Min.   :  1.00   Length:149         Length:149         Length:149        
##  1st Qu.: 38.00   Class :character   Class :character   Class :character  
##  Median : 75.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 75.07                                                           
##  3rd Qu.:112.00                                                           
##  Max.   :150.00                                                           
##                                                                           
##                     Email2                        Email3   
##                        :52                           :100  
##  a..shankman@enron.com : 1   a..martin@enron.com     :  1  
##  andrew.lewis@enron.com: 1   andrew.h.lewis@enron.com:  1  
##  azipper@enron.com     : 1   c.germany@enron.com     :  1  
##  b..sanders@enron.com  : 1   carol.stclair@enron.com :  1  
##  barbo@enron.com       : 1   dana_davis@enron.com    :  1  
##  (Other)               :92   (Other)                 : 44  
##                       EMail4       folder                     status  
##                          :147   Length:149         Employee      :41  
##  j..kean@enron.com       :  1   Class :character   N/A           :31  
##  peter.f.keavey@enron.com:  1   Mode  :character   Vice President:23  
##                                                    Director      :14  
##                                                    Manager       :14  
##                                                    (Other)       :25  
##                                                    NA's          : 1

We can see that, in the Email2, Email3, and EMail4 variable don’t have missing value but they are blank character. In the Email3 and EMail4 more than the half of the value are blank, maybe those variable aren’t very helpful for the analysis. In the variable status the NA are differently declared where we have 31 values with N/A and only 1 NA. For that variable we will need to replace the N/A by real NA values to homogenized the data.

message data set

Description of the data set variables and dimension:

##       mid                               sender            date           
##  Min.   :    52   jeff.dasovich@enron.com  :  6273   Min.   :0001-05-30  
##  1st Qu.: 88565   j.kaminski@enron.com     :  5838   1st Qu.:2000-12-01  
##  Median :186421   kay.mann@enron.com       :  5100   Median :2001-05-21  
##  Mean   :190260   sara.shackleton@enron.com:  4797   Mean   :1999-04-15  
##  3rd Qu.:279962   tana.jones@enron.com     :  4437   3rd Qu.:2001-10-25  
##  Max.   :404927   chris.germany@enron.com  :  3686   Max.   :2044-01-04  
##                   (Other)                  :222628                       
##                                          message_id       subject         
##  <10000282.1075847198841.JavaMail.evans@thyme>:     1   Length:252759     
##  <10000478.1075841161605.JavaMail.evans@thyme>:     1   Class :character  
##  <1000097.1075860055721.JavaMail.evans@thyme> :     1   Mode  :character  
##  <1000099.1075858574579.JavaMail.evans@thyme> :     1                     
##  <1000115.1075852075775.JavaMail.evans@thyme> :     1                     
##  <1000122.1075858816233.JavaMail.evans@thyme> :     1                     
##  (Other)                                      :252753

This data set contain 252759 rows and 5 columns.

Here we observed that, the mid and date variables identify as a numeric, the variables sender and message_id are attached to factor data type, and the variable subject is character data type.

Display of some observations in the data frame:

mid sender date message_id subject
52 2000-01-21 ENRON HOSTS ANNUAL ANALYST CONFERENCE PROVIDES BUSINESS OVERVIEW AND GOALS FOR 2000
53 2000-01-24 Over $50 – You made it happen!
54 2000-01-24 Over $50 – You made it happen!
55 2000-02-02 ROAD-SHOW.COM Q4i.COM CHOOSE ENRON TO DELIVER FINANCIAL WEB CONTENT
56 2000-02-07 Fortune Most Admired Ranking
57 2000-08-25 WPTF Friday Credo Veritas Burrito
58 2000-06-21 SAP ID - Here it is!!!!!
59 2000-06-27 Set of Graphs
60 2000-07-25 Block Forward Financial Trades
61 2000-07-27 Block forwards

By looking at the head of the data we observed that, the mid don’t look like numeric data but more has identifier like the eid variable in the employeelist table. In the data frame the date variable is associate to a date type. More over it seems that the observation in the subject variable are repeat several time suggesting they aren’t individual string but more a categorical variable.

Because the description seems to treat the variable date as a numeric type but the observation look like real date in the data display above we check with the class() function if R treat it correctly by evaluating if his data type is Date:

class(message$date) == "Date"
## [1] TRUE

The result confirm us R treat the date variable in the good data type meaning Date type. For this variable it is not necessary to adapt the data type.

In the date variable the min and max values return are strange date. In the introduction we saw that the data cover the period between 1999 and 2002 and those value aren’t in that period.

To understand what is those values we filter the table to get the year is less than 1999 or more than 2002:

message %>% 
  select(date) %>% #keep the date variable
  mutate(year = format(date,"%Y")) %>% #extract the year from the date
  filter((year < 1999) | (year > 2002)) %>% #keep the value below and after the study's period
  group_by(year) %>% count() #count the number of rows per date out of the study's period
## # A tibble: 10 × 2
## # Groups:   year [10]
##    year      n
##    <chr> <int>
##  1 0001    205
##  2 0002     53
##  3 1979      6
##  4 1997      1
##  5 1998     85
##  6 2004     53
##  7 2007      1
##  8 2020      2
##  9 2043      1
## 10 2044      3

In filtering the strange date we can see that some aren’t date (0001, 0002) and the other are out of the study’s period. This represent average 450 values which makes less than 1% of the observations in the table.

The variable mid and message_id could be redundancy. To verify that we will count the number of distinct value for both variable to see if a mid could be attached to several message_id.

message%>% select(mid, message_id) %>% #select only the variable we need.
  transform(mid = as.factor(mid)) %>% #transform the mid into factor data type.
  group_by(message_id) %>% 
  count(mid) %>% #count the number of mid per message_id group, create a n variable with the result.
  filter(n != 1) #filter to get the rows with a value different than 1.
## # A tibble: 0 × 3
## # Groups:   message_id [0]
## # ℹ 3 variables: message_id <fct>, mid <fct>, n <int>

This shown that, each message_id is attached to one and only one mid and confirm to us the redundancy of the 2 variables in the data frame. To lighten the data we can choose one of them to be kept in the dataframe for the analysis.

recipient info data set

Description of the data set variables and dimension:

##       rid               mid         rtype        
##  Min.   :     67   Min.   :    52   BCC: 253713  
##  1st Qu.: 718289   1st Qu.:105438   CC : 253735  
##  Median :1515296   Median :198263   TO :1556994  
##  Mean   :1543862   Mean   :196168                
##  3rd Qu.:2309682   3rd Qu.:280673                
##  Max.   :3242063   Max.   :404927                
##                                                  
##                        rvalue       
##  no.address@enron.com     :  19198  
##  jeff.dasovich@enron.com  :  11137  
##  richard.shapiro@enron.com:  11015  
##  steven.j.kean@enron.com  :  10873  
##  james.d.steffes@enron.com:  10615  
##  tana.jones@enron.com     :   9781  
##  (Other)                  :1991823

This data set contain 2064442 rows and 4 columns. The summary of the data reveal that, the rid and mid are consider as numeric variable by R and the variables rtype and rvalue are consider as factor data type.

Display of some observations in the data frame:

rid mid rtype rvalue
67 52 TO
68 53 TO
69 54 TO
70 55 TO
71 56 TO
72 56 TO
73 57 TO
74 58 TO
75 59 TO
76 60 TO

By looking at the head of this dataset we can see that rid and mid are identifier, with the result return by the summary function we need to transform those variables into factor data for having in the good type. Also, the mid variable is a foreign key allowed to link this table with the message table. Binding together this 2 table will allow us to have the sender and the receiver of the email as well as which type of receiver (direct with the to or “indirect” with the CC and BCC). The last variable rvalue is the email address of the receiver which can be general (e.g., , see in the head of the table) or specific to a person (e.g., , see as the top specific receiver in the summary of that table).

reference info data set

Description of the data set variables and dimension:

##       rfid            mid          reference        
##  Min.   :    2   Min.   :    79   Length:54778      
##  1st Qu.:14305   1st Qu.: 60580   Class :character  
##  Median :30987   Median :178176   Mode  :character  
##  Mean   :30860   Mean   :179738                     
##  3rd Qu.:46728   3rd Qu.:275557                     
##  Max.   :63024   Max.   :404920

This data set contain 54778 rows and 3 columns.

the summary pointed that, the variable rfid and mid are qualified as numeric type and the reference variable as a character type.

Display of some observations in the data frame:

rfid mid reference
5 14 845 From: Monaco, John [EM] [mailto:john.monaco@citi.com]Sent: Thursday, March 07, 2002 6:40 AMTo: Badeer, RobertSubject: FW: RE: Whats up!!!!!Still around!!!!—–Original Message—–From: [mailto:enron.mailsweeper.admin@enron.com] Sent: Thursday, March 07, 2002 9:36 AMTo: Monaco, John [EM]Subject: RE:RE: Whats up!!!!!The enron.com recipient(s) moved to a new organization. The new email address follows the (as per their original enron.comemail address). Email sent to recipient(s) at enron.com will not bedelivered.
6 15 846 From: Rangel, Ina Sent: Thursday, March 07, 2002 8:11 AMTo: Badeer, RobertSubject: Expense ReceiptsBob:I received your expense receipts today. Will submit them today.Ina Rangel
7 16 847 From: Grigsby, Mike Sent: Friday, March 08, 2002 9:08 AMTo: Badeer, RobertSubject: RE: BADGEGo with Ina —–Original Message—–From: Badeer, Robert Sent: Friday, March 08, 2002 11:08 AMTo: Grigsby, MikeSubject: RE: BADGEGrigs, Ina said it would be on the 5th floor of the new building. Which is right? —–Original Message—–From: Grigsby, Mike Sent: Friday, March 08, 2002 6:46 AMTo: Badeer, RobertSubject: BADGEYour badge will be waiting for you at the front desk in the north tower on mon. if not, then call and we will retrieve you.Michael D. Grigsby, Executive DirectorUBS Warburg Energy, LLCWork: 713-853-7031Mobile: 713-408-6256
8 17 848 From: Grigsby, Mike Sent: Friday, March 08, 2002 6:46 AMTo: Badeer, RobertSubject: BADGEYour badge will be waiting for you at the front desk in the north tower on mon. if not, then call and we will retrieve you.Michael D. Grigsby, Executive DirectorUBS Warburg Energy, LLCWork: 713-853-7031Mobile: 713-408-6256
9 18 849 From: Rangel, Ina Sent: Thursday, March 07, 2002 12:56 PMTo: Badeer, RobertSubject: FW: Badge AccessWhen you get here on Monday morning, come to the 5th floor reception of the new building. If your badge is not there, then I will come and pick you up when you get here and bring you up. Your badge will be ready Monday for sure, whether it be morning or afternoon I am not sure of.-Ina —–Original Message—–From: Curless, Amanda Sent: Thursday, March 07, 2002 2:50 PMTo: Rangel, InaSubject: RE: Badge AccessIna,We can most likely have this by Monday morning and he can pick this up at the 5th floor reception. If he has any problems he can call me. Thanks!Mandy —–Original Message—–From: Rangel, Ina Sent: Thursday, March 07, 2002 2:39 PMTo: Curless, AmandaSubject: RE: Badge Access << File: Badge Access Form.doc >> I filled out all of the information that I had on him. Will he be able to have his badge by Monday morning and where will he go to pick it up.Ina —–Original Message—–From: Curless, Amanda Sent: Thursday, March 07, 2002 2:00 PMTo: Rangel, InaSubject: Badge Access << File: Badge Access Form.doc >> Ina,Pleae fill out and return to me at ECS 05848. You can e-mail this to me if this is easier. Thanks!Mandy
10 19 851 From: Hyatt, Kevin Sent: Wednesday, July 25, 2001 1:00 PMTo: Nielsen, JeffSubject: RE: Mid 4 to Mid 3 QuoteJeff, can you fill in the rates for the 5,7, and 10 year terms for me. These would be notional of course. Let me know if you have questions.thxKevin 713-853-5559 Term/yrs. 2 5 7 10 Demand: Firm* $.02 - .03 $.04-.05 $.06-.07 $.07-.08 TI $.035 - .045 \(.065-\).075 $.075-.085 $.095-.105 Volume is min. 0 to max of 200,000/d * plus minimum commodity Primary to El Paso Waha would be slightly higher Rates are plus fuel —–Original Message—–From: Nielsen, Jeff Sent: Monday, July 23, 2001 4:39 PMTo: Hyatt, KevinSubject: Mid 4 to Mid 3 QuoteKevin,Jo Williams said that you needed a quote for transportation from Mid 4 to Mid 3 in the Waha area. On a firm basis we would be would in the $.02 to $.03 demand range plus minimum commodity. For a TI rate use between $.035 and $.045. If you would like primary to El Paso Waha, that rate would be a little higher. We have been able to get additional value out of that interconnect because of the gas prices in California. Please let me know if you need any additional information.Jeff 402-398-7434

By looking at the head of that table we can see that:

  • the rfid and mid aren’t numeric variable but look like identifier. It will be necessary to change their data type for factor for it be better adapted.

  • the reference variable describe the content of each message.

  • the existence here of the mid variable allow us to merge that table with the message and/or the recipientinfo table.

  • in the message and recipientinfo table we have email address like in the employeelist info. We could thinks that, this table can be merged through this.

By exploring those data set we identify some issues needed to be handle before the analysis such as data type change, missing values handling, variable redundancy, and data set merging.

We choose to :

  • change the data type of the identifier variable in the different table from numeric to factor.

  • change the data type of the subject variable from character to factor.

  • withdraw the message_id variable in the message table to lighten the dataset. In addition we drop the lines for which the date aren’t in the study’s period (from 1999 to 2002) and the strange date.

  • for the moment, withdraw the variable Email3 and EMail4 variable in the employeelist table because they contain al lot of blank value and every employee have one Email id filled. We choose to keep for the moment the Email2 variable because it contain only 52 blank observation over the 149 which makes average 25% of the data.

  • don’t keep the referenceinfo table because it contain only 54,778 observation which makes only 2% of the recipientinfo table. This suggest that, this table isn’t exhuastive and adding those info in a merging table with recipientinfo will create a lot of missing values.

  • Creates a table which bind all the information about the message by merging together the table message and recipientinfo through the mid foreign key.

Data engineering and cleaning

Employeelist table

employeelist_2 <- employeelist %>% 
  select(-c(Email3, EMail4)) %>% #the variable we don't need in the data
  transform(eid = as.factor(eid)) %>% #data type change for the variable eid to factor
  mutate(status = if_else((status == "N/A"), NA, status)) #homogenized the declaration of the NA in the variable status

Description of the new table employee list:

summary(employeelist_2)
##       eid       firstName           lastName           Email_id        
##  1      :  1   Length:149         Length:149         Length:149        
##  2      :  1   Class :character   Class :character   Class :character  
##  3      :  1   Mode  :character   Mode  :character   Mode  :character  
##  4      :  1                                                           
##  5      :  1                                                           
##  6      :  1                                                           
##  (Other):143                                                           
##     Email2             folder                     status  
##  Length:149         Length:149         Employee      :41  
##  Class :character   Class :character   Vice President:23  
##  Mode  :character   Mode  :character   Director      :14  
##                                        Manager       :14  
##                                        Trader        :13  
##                                        (Other)       :12  
##                                        NA's          :32

Verification of the data type of the table variables:

#return the data type for every variable in the table
str(employeelist_2)
## 'data.frame':    149 obs. of  7 variables:
##  $ eid      : Factor w/ 149 levels "1","2","3","4",..: 13 6 19 115 129 18 33 148 52 21 ...
##  $ firstName: chr  "Marie" "Mark" "Lindy" "Lisa" ...
##  $ lastName : chr  "Heard" "Taylor" "Donoho" "Gang" ...
##  $ Email_id : chr  "marie.heard@enron.com" "mark.e.taylor@enron.com" "lindy.donoho@enron.com" "lisa.gang@enron.com" ...
##  $ Email2   : chr  "" "mark.taylor@enron.com" "ldonoho@enron.com" "" ...
##  $ folder   : chr  "heard-m" "taylor-m" "donoho-l" "gang-l" ...
##  $ status   : Factor w/ 10 levels "CEO","Director",..: NA 3 3 NA 1 2 NA 3 3 2 ...

The result from summary and the str function show us the data type change, the NA homogenized, and the suppression of the variable is done correctly. We can now used this table to pursue the analysis.

message table

message_2 <- message %>%
  select(-c(message_id)) %>% #withdraw the variable we don't need
  transform(#change the data type for factor
    mid = as.factor(mid),
    sender = as.factor(sender),
    subject = as.factor(subject)) %>%
  #add the year variable in the table from the date
  mutate(year = as.factor(format(date, "%Y"))) %>% 
  #filter to keep only the date from 1999 to 2002
  filter(year %in% c(1999 : 2002)) %>% #drop the year variable which is no more useful in the data
  select(-year)

recipientinfo

recipientinfo_2 <- recipientinfo %>%
  #change the variable data type for factor
  transform(rid = as.factor(rid),
            rvalue = as.factor(rvalue),
    mid = as.factor(mid))

referenceinfo

referenceinfo_2 <- referenceinfo %>%
  #change the variable data type for factor
  transform(rfid = as.factor(rfid),
    mid = as.factor(mid))

Data analysis

#in this part we will draw many plot, every will have the same theme
theme_set(theme_light())

the employee liste

Number of employee per status :

employeelist_2 %>% select(status) %>% #select the needed variable
  group_by(status) %>% count() %>% #count the number of employee per status
  ungroup() %>%
  #calculate the percentage for each status
  mutate(perc = `n`/sum(`n`),
  labels = scales::percent(perc)) %>%
  #bar chart
  ggplot(aes(reorder(status, perc ,sum),perc, fill = status)) +
  geom_bar(stat = "identity") +
  #to invert the axis's position
  coord_flip()+ 
  #customize the theme, title and axis labels
  geom_text(aes(label = labels), vjust = 0.5, size = 4) + #display the percentage for each category at the end of the corresponding bar
  scale_y_continuous(labels = scales::percent_format())+
  ggtitle("Number of employee per status in Enron company")+
  labs(y = "Percentage (%)",x = "Employee status") +
  scale_fill_brewer(palette = "Set3", 
                    #to display the NA in grey on the graph
                    na.value = "grey50")+
  theme(legend.position = "none")

The above bar chart shows us that:

  • most of the employee have an employee or unknown status (respectively 27.48% and 21.48%)

  • they have few lawyer (less than 1% of the total number of employee)

  • surprisingly a lot of employee have a vice president status (average 15% of the employee)

  • it has a similar number of manager, director, and trader in the company (average 9% for each)

  • at the head of the company it has several CEO, President, and managing director (average 2% for each)

!!!List of thing need to be achieved !!!

=> merge the employee list with the message DF through the email address

=> for the analysis on employeelist table: - count the number of email address for each employee

=> for the message_df: - look at the number of email per email address - look at the type of email (TO, CC, BC) - look at the number of eamil exchange between month and year -> see if it as significant change around the historic date

=> for the referenceinfo make an analysis of the message content or do it with the subject from recipientinfo table where is more exhaustive. With the subject makes a classification of the potential different type of email send/receive.

=> make a radar chart to display the status and the number of email send/recieve by each. To do it follow this tutorial: https://r-graph-gallery.com/web-circular-barplot-with-R-and-ggplot2.html

=> pursue with the analytics question in the Exam project rules.

=> See if it should be useful to separate the date variable into month and year.

=> see to clean a little the environment to keep in memory only the needed table.